Middle-end solution to robust speech recognition

ABSTRACT

A method for performing time and frequency Signal-to-Noise Ratio (SNR) dependent weighting in speech recognition is described that includes for each period t estimating the SNR to get time and frequency SNR information η t,f ; calculating the time and frequency weighting to get γ tf ; performing the back and forth weighted time varying DCT transformation matrix computation MG t M −1  to get T t ; providing the transformation matrix computation T t  and the original MFCC feature o t  that contains the information about the SNR to a recognizer including the Viterbi decoding; and performing weighted Viterbi recognition b j (o t ).

FIELD OF INVENTION

This invention relates to speech recognition and more particularly toSignal-to-Noise Ratio (SNR) dependent decoding and weighted Viterbirecognition.

BACKGROUND OF INVENTION

A technique of time-varying SNR dependent coding for increasedcommunication channel robustness is described by A. Bernard, one of theinventors herein, and A. Alwan in “Joint channel decoding—ViterbiRecognition for Wireless Applications”, in Proceedings of Eurospeech,Sebt. 2001, vol. 4, pp. 2703-6; A. Bernard, X. Liu, R. Wesel and A.Alwan in “Speech Transmission Using Rate-Compatable Trellis codes andEmbedded Source Coding,” IEEE Transactions on Communications, vol. 50,no. 2, pp 309-320, Feb. 2002.; and A. Bernand in “Source and ChannelCoding for Speech and Remote Speech Recognition,” Ph.D. thesis,University of California, Los Angeles, 2001.

For channel and acoustic robustness is described by X. Cui, A. Bernard,and A. Alwan in “A Noise-robust ASR back-end technique based on WeightedViterbi Recognition,” in Proceedings of Eurospeech, September 2003, pp.2169-72.

Speech recognizers compare the incoming speech to speech models such asHidden Markov Models HMMs to identify or recognize speech. Typicalspeech recognizers combine the likelihoods of the recognition featuresof each speech frame with equal importance to provide the overalllikelihood of observing the sequence of feature vectors. Typicallyrobustness in speech recognition is dealt with either at the front end(by cleaning up the features) or at the back end (by adapting theacoustic model to the particular acoustic noise and channelenvironment).

Such classic recognizers fail to differentiate between the particularimportance of each individual frame, which can significantly reducerecognition performance when the importance of each frame can bequantitatively estimated into a weighted recognition mechanism.

SUMMARY OF INVENTION

In accordance with one embodiment of the present invention a procedurefor performing speech recognition which can integrate, besides the usualspeech recognition feature vector, information regarding the importanceof each feature vector (or even frequency band within the featurevector). Applicant's solution leaves both the acoustic features andmodels intact and only modifies the weighting formula in the combinationof the individual frame likelihoods.

In accordance with an embodiment of the present invention a method forperforming time and frequency SNR dependent weighting in speechrecognition includes for each period t estimating the SNR to get timeand frequency SNR information η_(t,f); calculating the time andfrequency weighting to get γ_(tf); performing the back and forthweighted time varying DCT transformation matrix computation MG_(t)M⁻¹ toget T_(t); providing the transformation matrix computation T_(t) and theoriginal MFCC feature o_(t) that contains the information about the SNRto a recognizer including the Viterbi decoding; and performing weightedViterbi recognition b_(j)(o_(t)).

DESCRIPTION OF DRAWING

FIG. 1 is an illustration of the Viterbi algorithm for HMM speech wherethe vertical dimension represents the state and the horizontal dimensionrepresents the frames of speech (i.e. time).

FIG. 2 is a block diagram of time and frequency SNR dependent weightedViterbi recognition.

FIG. 3 illustrates the performance of t-WVR back-end on the Aurora-2database for different SNRs.

DESCRIPTION OF PREFERRED EMBODIMENT

Review of Time Weighted Viterbi Recognition

In general, there are two related approaches to solve the temporalalignment problem with HMM speech recognition. The first is theapplication of dynamic programming or Viterbi decoding, and the secondid the more general forward/backward algorithm. The Viterbi algorithm(essentially the same algorithm as the forward probability calculationexcept that the summation is replaced by a maximum operation) istypically used for segmentation and recognition and the forward/backwardfor training. See for the Viterbi algorithm G. D. Fornay, “The Viterbialgorithm,” IEEE Transactions on Communications, vol. 61, no. 3, pp.268-278, April 1973.

The Viterbi algorithm finds the state sequence Q that maximizes theprobability P* observing the features sequence (O=o₁, . . . o_(t)) giventhe acoustic model λ $\begin{matrix}{P^{*} = {\max\limits_{{All}\quad Q}{{P\left( {Q,{O❘\lambda}} \right)}.}}} & (1)\end{matrix}$

In order to calculate P* for a given model λ, we define the metricφ_(j)(t), which represents the maximum likelihood of observing thefeatures sequence (O=o₁, . . . o_(t)) given that we are in state j attime t. Based on dynamic programming, this partial likelihood can becomputed efficiently using the following recursion $\begin{matrix}{{\varphi_{j}(t)} = {\max\limits_{i}{\left\{ {{\varphi_{j}\left( {t - 1} \right)}a_{ij}} \right\}{{b_{j}\left( o_{t} \right)}.}}}} & (2)\end{matrix}$

The maximum likelihood P* (O|λ) s then given by P*(O|λ)=max_(j){φ_(j)(T)}.

The recursion (2) forms the basis of the Viterbi Algorithm (VA) whoseidea is that there is only one “best” path to state j at time t.

As shown in FIG. 1, this algorithm can be visualized as finding the bestpath through a trellis where the vertical dimension represents thestates of the HMM and the horizontal dimension represents the frames ofspeech (i.e. time).

Time Weighted Viterbi Recognition (WVR)

In speech recognition, the quality of speech features can depend on manyfactors: acoustic noise, microphone quality, quality of communication,etc. The weighted Viterbi recognizer (WVR), presented in the “JointChannel Decoding-Viterbi Recognion for Wireless Applications,” citedabove, modifies the Viterbi algorithm (VA) to take into account thequality of the feature.

The time-varying quality γ_(t) of the feature vector at time t isinserted in the VA by raising the probability b_(j)(o_(t))to the powerγ_(t) to obtain the following state metrics update equation:$\begin{matrix}{\varphi_{j,t} = {\max\limits_{i}{\left\lbrack {\varphi_{i,{t - 1}}a_{ij}} \right\rbrack\left\lbrack {b_{j}\left( o_{t} \right)} \right\rbrack}_{t}^{\gamma}}} & (3)\end{matrix}$where φ_(j,t) is the state metric for state j at time t and a_(ij) isthe state transition metric. Such weighting has the advantage ofbecoming a simple multiplication of log (b_(j)(o_(t))) by γ_(t) in thelogarithmic domain often used for scaling purposes. Furthermore, notethat if one is certain about the received feature, γ_(t)=1 and equation3 is equivalent to equation. 2. On the other hand, if the decodedfeature is unreliable, γ_(t)=0 and the probability of observing thefeature given the HMM state model b_(j)(o_(t)) is discarded in the VArecursive step.

Under the hypothesis of a diagonal covariance matrix Σ, the overallprobability b_(j)(o_(t)) can be computed as the product of theprobabilities of observing each individual feature. The weightedrecursive formula (equation 3) can include individual weighting factorsγ_(t,t) for each of the N_(F) front-end features. $\begin{matrix}{\varphi_{j,t} = {{\max\left\lbrack {\varphi_{i,{t - 1}}a_{ij}} \right\rbrack}{\prod\limits_{k = 1}^{N_{F}}\quad\left\lbrack {b_{j}\left( o_{t} \right)} \right\rbrack_{k,{\ldots\quad t\quad\ldots}}^{\gamma}}}} & (4)\end{matrix}$where k indicates the dimension of the feature observed.Time and Frequency WVR

In accordance with the present invention we provide an extension to thetime-only weighted recognition presented in equation3. First, we presenthow we can use both time and frequency weighting. Second, we present howthe weighting coefficients can be obtained.

Time and Frequency weighting

With time weighting only, the insertion of the weighting coefficient inthe overall likelihood computation could be performed after theprobability b_(j) (o_(t)) had been computed by raising it to the powerγ_(t), using {tilde over (b)}_(j)(o_(t))=[b_(j)(o_(t))]^(γ) _(t).

In order to perform time and frequency SNR dependent weighting, we needto change the way the probability b_(j) (o_(t)) is computed. Normally,the probability of observing the N_(F)-dimensional feature vector o_(t)in the j^(th) state is computed as follows, $\begin{matrix}{{{b_{j}\left( o_{t} \right)} = {\sum\limits_{m = 1}^{N_{M}}\quad{w_{m}\frac{1}{\sqrt{\left( {2\pi} \right)^{N_{F}}\lbrack\Sigma\rbrack}}{\mathbb{e}}^{{- \frac{1}{2}}{({{ot} - \mu})}^{\prime}{\sum\limits^{- 1}\quad{({{ot} - \mu})}}}}}},} & (5)\end{matrix}$where N_(M) is the number of mixture components, w_(m) is the mixtureweight, and the parameters of the multivariate Gaussian mixture are itsmean vector μ and covariance matrix Σ.

In order to simplify notation, we should only note thatlog(b_(j)(o_(t))) is proportional to a weighted sum of the cepstraldistance between the observed feature and the cepstral mean (o_(t)−μ),where the weighting coefficients are based on the inverse covariancematrix (Σ⁻¹), $\begin{matrix}{{\log\left( {b_{j}\left( o_{t} \right)} \right)}{\infty\left( {{ot} - \mu} \right)}^{\prime}{\sum\limits^{- 1}\quad{\left( {o_{t} - \mu} \right).}}} & (6)\end{matrix}$

Remember that the N_(F)-dimensional cepstral feature o_(t) is obtainedby performing the Discrete Cosine Transform (DCT) on theN_(S)-dimensional log Mel spectrum (S). Mathematically, if theN_(S)×N_(F) dimensional matrix M represents the DCT transformationmatrix, then we have o_(t)=MS. Reciprocally, we have S=M⁻¹ o_(t) whereM⁻¹ (N_(S)×N_(F)) represent the inverse DCT matrix.

Since usually the frequency weighting coefficients we have at hand willbe in the log spectral domain (whether linear or Mel spectrum scale isnot important) and not in the cepstral domain, we use the inverse DCTmatrix S=M⁻¹ to transform the cepstral distance (o_(t)−μ) into aspectral distance. Once in the spectral domain, time and frequencyweighting can be applied by means of a time-varying diagonal matrixG_(t) which represents the weighting coefficients γ_(t,f),G _(t)=diag(γ_(t,f))  (7)

Finally, once the weighting has been performed, we can go back to thespectral domain by performing the forward DCT operation. All together,the time and spectral frequency weighting operation on the cepstraldistance d=(o_(t)−μ) becomes{tilde over (d)}=MG _(t) M ⁻¹(o _(t)−μ)  (8)

With this notation, the weighted probability of observing the featurebecomes $\begin{matrix}{{{\overset{\sim}{b}}_{j}\left( o_{t} \right)} = {\sum\limits_{m = 1}^{N_{M}}\quad{w_{m}\frac{1}{\sqrt{\left( {2\pi} \right)^{N_{F}}\lbrack\Sigma\rbrack}}{\mathbb{e}}^{{- \frac{1}{2}}{({{ot} - \mu})}^{\prime}{({{MG}_{t}M^{- 1}})}^{\prime}{\sum\limits^{- 1}{{({{MG}_{t}M^{- 1}}\quad)}{({{ot} - \mu})}}}}}}} & (9)\end{matrix}$which can be rewritten using a back-and-forth weighted time-varyingtransformation matrix T_(t)=MG_(t)M⁻¹ as $\begin{matrix}{{\left. {{{\overset{\sim}{b}}_{j}\left( o_{t} \right)} = {\sum\limits_{m = 1}^{N_{M}}\quad{w_{m}\frac{1}{\sqrt{\left( {2\pi} \right)^{N_{F}}\lbrack\Sigma\rbrack}}\mu}}} \right){\mathbb{e}}^{{- \frac{1}{2}}{({{ot} - \mu})}^{\prime}{({Tt}^{\prime})}{\sum\limits^{- 1}\quad{{({Tt})}{({{ot} - \mu})}}}}},} & (10)\end{matrix}$which can also resemble the unweighted equation 5 with a new inversecovariance matrix $\begin{matrix}\begin{matrix}{{{\overset{\sim}{\Sigma}}^{- 1} = {T_{t}^{\prime}{\sum\limits^{- 1}t_{t}}}},} \\{{{\overset{\sim}{b}}_{j}\left( o_{t} \right)} = {\sum\limits_{m = 1}^{N_{M}}\quad{w_{m}\frac{1}{\sqrt{\left( {2\pi} \right)^{N_{F}}\lbrack\Sigma\rbrack}}{\mathbb{e}}^{{- \frac{1}{2}}{({{ot} - \mu})}^{\prime}{{\overset{\sim}{\Sigma}}^{- 1}{({{ot} - \mu})}}}}}}\end{matrix} & (11)\end{matrix}$

To conclude this part on time and frequency weighting, note that timeweighting only is a special case of time and frequency weighting whereG_(t)=γ_(t)·I where I is the identity matrix, which also means that theweighting is the same for all the frequencies.

Determining the Weighting Coefficients

In order to have the system performing SNR dependent decoding, we firstneed a time and frequency SNR evaluation. In the special case presentedabove, the time frequency scale is the frame based (every 10 ms) and thefrequency scale is the Mel frequency scale, which divides the narrowbandspeech spectrum (0-4 kHz) in 25 non-uniform bandwidth frequency bins.

In that specific case, the time and frequency SNR evaluation we areusing for the purpose of evaluating the presented technique is that ofthe ETSI Distributed Speech Recognition standard [6] which evaluates theSNR in the time and frequency domain for spectral subtraction purposes.See ETSI STQ-Aurora DSR Working Group, “Extended Advanced Front-End(xafe) Algorithm Description,” Tech. Rep., ETSI, March 2003.

Regardless of the technique used to obtain such time and frequencydependent SNR estimate, we decide to refer to such value as η_(t,f).η_(t,f) is the SNR at frequency f at t time. The weighting coefficientγ_(tf) can be obtained by performing any function which willmonotonically map the values taken by the SNR evaluation (logarithmic orlinear) to the interval [0,1] of the values that can be taken by theweighting coefficients γ_(tf). In other words, we haveγ_(tf) =f(η_(t,f))  (12)

One particular instantiation of equation 12 is using a Wiener filtertype equation applied on the linear SNR estimate to obtain,${\gamma_{t,f} = \frac{\sqrt{\eta_{t,f}}}{1 + \sqrt{\eta_{t,f}}}},$which guarantees that γ_(tf) is equal to 0 when η_(t,f)=0 and γ_(t,f)approaches 1 when η_(t,f) is large.

FIG. 2 illustrates the block diagram for the time and frequency weightedViterbi recognition algorithm. When you have speech (speech frame t) thefirst step 21 is to estimate the SNR to get η_(t,f). Then the weightingis calculated to get γ_(tf) at step 23. Then the transform matrixcomputation at step 25 is performed. This is the MG_(t)M⁻¹ to get Tt.The next step is Viterbi decoding at step 27 to get b_(j)(o_(t)). Herethe original MFCC feature o_(t) is sent to the recognizer. The originalfeature contains the information about the SNR.

Performance Evaluation

Experimental Conditions

We used the standard Aurora-2 testing procedure, which averagesrecognition performance over 10 different noise conditions (two withchannel mismatch in Test C) at 5 different SNR levels (20 dB, 15 dB, 10dB, 5 dB and 0 dB).

As a reminder, performance is established using the followingconfiguration: a 21-dimensional feature vector (16 Mel frequencycepstral coefficients (MFCC) features with 1^(st) order derivative)extracted every 10 ms and 16 states word HMM models with 20 Gaussianmixtures per state.

Performance of Time-WVR Algorithm

FIG. 3 summarizes the performance of time-WVR algorithm on the Aurora-2database. As expected, the t-WVR algorithm improves recognitionaccuracies mainly in the medium SNR range. Indeed, it is in the mediumSNR range that the frames distinction that can be obtained by performingSNR dependent weighting is the most useful. At low (resp. high) SNRrange most features are already usually bad (good).

In accordance with the present invention the weighting function can beapplied in the logarithmic domain using a simple multiplicativeoperation. The weighting coefficient can be the output of many differentimportant estimation mechanisms, including a frame SNR estimation, apronunciation probability estimation, a transmission over a noisycommunication channel reliability estimation, etc.

Although preferred embodiments have been described, it will be apparentto those skilled in the art that various modifications, additions,substitutions and the like can be made without departing from the spiritof the invention and these are therefore considered to be within thescope of the invention as defined in the following claims.

1. A method for performing time and frequency SNR dependent weighting inspeech recognition comprising the steps of: for each speech frame testimating the SNR to get time and frequency SNR information η_(t,f);calculating the time and frequency weighting to get γ_(tf); performingthe back and forth weighted time varying DCT transformation matrixcomputation MG_(t)M⁻¹ to get T_(t); providing the transformation matrixcomputation T_(t) and the original MFCC feature o_(t) that contains theinformation about the SNR to a recognizer including the Viterbidecoding; and performing weighted Viterbi recognition b_(j)(o_(t)). 2.The method of claim 1 wherein${\gamma_{t,f} = \frac{\sqrt{\eta_{t,f}}}{1 + \sqrt{\eta_{t,f}}}},$which guarantees that γ_(tf) is equal to 0 when η_(t,f)=0 and γ_(t,f)approaches 1 when η_(t,f) is large.
 3. A method for performing time andfrequency SNR dependent weighting in speech recognition comprising thesteps of: for each period t estimating the SNR to get time and frequencySNR information η_(t,f); calculating the time and frequency weighting toget γ_(tf); performing the back and forth weighted time varying DCTtransformation matrix computation MG_(t)M⁻¹ to get T_(t); providing thetransformation matrix computation T_(t) and the original MFCC featureo_(t) that contains the information about the SNR to a recognizerincluding the Viterbi decoding; and performing weighted Viterbirecognition b_(j)(o_(t)).
 4. The method of claim 3 wherein saidestimating step is a pronunciation probability estimation step.
 5. Themethod of claim 3 wherein said estimating step is a transmission over anoisy communication channel reliability estimation.
 6. The method ofclaim 3 wherein${\gamma_{t,f} = \frac{\sqrt{\eta_{t,f}}}{1 + \sqrt{\eta_{t,f}}}},$which guarantees that γ_(tf) is equal to 0 when η_(t,f)=0 and γ_(t,f)approaches 1 when η_(t,f) is large.